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Abstract 



Penalized regression is an attractive framework for variable selection problems. Often, vari- 
ables possess a grouping structure, and the relevant selection problem is that of selecting groups, 
not individual variables. The group lasso has been proposed as a way of extending the ideas of 
the lasso to the problem of group selection. Nonconvex penalties such as SCAD and MCP have 
been proposed and shown to have several advantages over the lasso; these penalties may also 
be extended to the group selection problem, giving rise to group SCAD and group MCP meth- 
ods. Here, we describe algorithms for fitting these models stably and efficiently. In addition, 
we present simulation results and real data examples comparing and contrasting the statistical 
properties of these methods. 

1 Introduction 

In regression modeling, explanatory variables can often be thought of as grouped. To represent 
a categorical variable, we may introduce a group of indicator functions. To allow flexible modeling 
of the effect of a continuous variable, we may introduce a series of basis functions. Or the variables 
may simply be grouped because the analyst considers them to be similar in some way, or because a 
scientific understanding of the problem implies that a group of covariates will have similar effects. 

Taking this grouping information into account in the modeling process should improve both the 
interpretability and the accuracy of the model. These gains are likely to be particularly important 
in high-dimensional settings where sparsity and variable selection play important roles in estimation 
accuracy. 

Penalized likelihood methods for coefficient estimation and variable selection have become 
widespread since the original proposal of the lasso (Tibshirani, 1996). Building off of earlier work 
by Bakin (1999), Yuan and Lin (2006) extended the ideas of penalized regression to the prob- 
lem of grouped covariates. Rather than penalizing individual covariates, Yuan and Lin proposed 
penalizing norms of groups of coefficients, and called their method the group lasso. 

The group lasso, however, suffers from the same drawbacks as the lasso. Namely, it generally 
does not have the selection consistency property and tends to over-shrink large coefficients. This 
is because the rate of penalization of the group lasso does not change with the magnitude of the 
group coefficients, which leads to biased estimates of large coefficients. To compensate for the 
over-shrinkage, the group lasso tends to include spurious coefficients into the model. 
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The smoothly clipped absolute deviation (SCAD) penalty and the minimax concave penalty 
(MCP) were developed in an effort to achieve what the lasso could not: simultaneously selection 
consistency and asymptotic unbiasedness (Fan and Li, 2001; Zhang, 2010). This achievement 
is known as the oracle property, so named because it implies that the model is asymptotically 
equivalent to the fit of a maximum likelihood model in which the identities of the truly nonzero 
coefficients are known in advance. These properties extend to group SCAD and group MCP models, 
as shown in Wang et al. (2008) and Huang et al. (to appear). 

However, group SCAD and group MCP have not been widely used or studied in comparison with 
the group lasso, largely due to a lack of efficient and publicly available algorithms for fitting these 
models. Published articles on the group SCAD (Wang et al., 2007, 2008) have used a local quadratic 
approximation for fitting these models. The local quadratic approximation was originally proposed 
by Fan and Li (2001) to fit SCAD models. However, by relying on a quadratic approximation, the 
approach is incapable of producing naturally sparse estimates, and therefore cannot take advantage 
of the computational benefits provided by sparsity. This, combined with the fact that solving the 
local quadratic approximation problem requires the repeated factorization of large matrices, makes 
the algorithm very inefficient for fitting large regression problems. Zou and Li (2008) proposed 
a local linear approximation for fitting SCAD models and demonstrated its superior efficiency to 
local quadratic approximations. This algorithm was further improved upon by Breheny and Huang 
(2011), who demonstrated how a coordinate descent approach may used to fit SCAD and MCP 
models in a very efficient manner capable of scaling up to deal with very large problems. 

Here, we show how the approach of Breheny and Huang (2011) may be extended to fit group 
SCAD and group MCP models. We demonstrate that this algorithm is very fast and stable, and we 
provide a publicly available implementation in the grpreg package (http : / / cran . r-pro j ect . org/ 
web/packages/grpreg/index.html. In addition, we provide examples involving both simulated 
and real data which demonstrate the potential advantages of group SCAD and group MCP over 
the group lasso. 

2 Group descent algorithms 

We consider models in which the relationship between the outcome and the explanatory variables 
is specified in terms of a linear predictor r/: 



where Xj is the portion of the design matrix formed by the predictors in the j'th group and the vector 
(3j consists of the associated regression coefficients. Letting Kj denote the number of members in 
group j, ~Kj is an n x Kj matrix with elements (xijk), the value of kth. covariate in the jth group 
for the ith. subject. Covariates that do not belong to any group may be thought of as a group of 
one. 

The problem of interest involves estimating a vector of coefficients (3 using a loss function L 
which quantifies the discrepancy yi and r/j combined with a penalty that encourages sparsity and 
prevents overfitting; specifically, we estimate f3 by minimizing 



J 




(2.1) 




(2.2) 
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where p(-) is a penalty function applied to the Euclidean norm (L2 norm) of (3j. The penalty is 
indexed by a regularization parameter A, which controls the tradeoff between loss and penalty. It 
is not necessary for A to be the same for each group; i.e., we may consider a collection of regularity 
parameters {Aj}. For example, in practice there are often variables known to be related to the 
outcome and therefore which we do not wish to include in selection or penalization. The above 
framework and algorithms which follow may be easily extended to include such variables by setting 
A, =0. 

In this section, we primarily focus on linear regression, where E(y) = r] and L is the least 
squares loss function, but take up the issue of logistic regression in Section 2.4, where L arises from 
the binomial likelihood. To ensure that the penalty is invariant to scale, covariates are standardized 
prior to fitting such that Yli x ijk = and n^ 1 J2i x %k = 1- As emphasized in Simon and Tibshirani 
(2011), this standardization is just as important for the group lasso as it is for the lasso, and is a 
separate issue from that of orthogonalization (discussed in Section 2.1). We assume without loss 
of generality that the covariates are standardized in this way during the model fitting process and 
then transformed back to the original scale once all models have been fit. For linear regression, 
we also assume, again without loss of generality, that the response has been centered such that 
J2i Hi = 0; m this case A) = an d may be ignored. 

2.1 Orthogonalization 

The algorithm for fitting group lasso models originally proposed by Yuan and Lin (2006) requires 
the groups Xj to be orthonormal, as does the extension to logistic regression proposed in Meier et al. 
(2008). It is somewhat unclear from these papers, however, whether orthogonality is a necessary 
aspect of problems that can be solved by the group lasso, or whether it may be introduced without 
loss of generality for the sake of model fitting. 

We attempt to clarify the issue of orthonormality here, and demonstrate that groups may be 
made orthonormal to aid in model fitting without any loss of generality, an issue is also discussed in 
Simon and Tibshirani (2011). This considerably extends the range of designs that can be analyzed 
with the group lasso. For example, a group of indicator variables is orthonormal only under a 
balanced design having an equal number of observations in each category. Our approach requires 
no such balance, however. 

In this article, we develop algorithms dealing with the common scenario where the analyst (a) 
does not necessarily have orthogonal groups {Xj}, (b) wants solutions on the scale and units of 
the original explanatory variables, and (c) does not wish to deal with or think about any issues of 
orthonormality in fitting the model. In this section, we show that these goals may be accomplished 
by orthogonalizing Xj using a Cholesky decomposition, fitting the model, and then using the 
decomposition to transform the solution back to the original scale. 

Taking the Cholesky decomposition of the Gram matrix of the jth group, we have 

-XjXj = UjUj, 

where U is an upper triangular matrix. Now, we may construct a linear transformation Xj = 
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XjUj with the following properties: 

1 ~ T ~ 
-XjX 7 - = I 

n J 

where I is the identity matrix and (3j is the solution to (2.2) on the orthogonalized scale. In other 
words, if we have the solution to the orthogonalized problem, we may easily transform back to the 
original problem with fij = XJJ 1 ^. This procedure is not terribly expensive from a computational 
standpoint, as the decompositions are being applied only to the groups, not the entire design 
matrix, and the inverses are easily obtained due to the triangular structure of TJj. Furthermore, 
the decompositions need only be computed once, not with every iteration. 

For the remainder of this article, we will assume that this process has been applied and that 
the model fitting algorithms we describe are being applied to orthogonal groups, and we will drop 
the tildes on X and (3 in the rest of the article. It is worth clarifying that by "orthogonal groups", 
we mean groups for which n _1 XjXj = I, not that groups Xj and X& are orthogonal to each other. 

Again, we note that this orthonormalization may be accomplished without loss of generality 
- we are still solving the original minimization problem, just with a change of variables to aid in 
the model fitting process. Furthermore, the issue of orthogonalization is entirely separate from the 
issue of standardization (requiring that n^ 1 J2i x 'ijk = Both standardized and unstandardized 
groups Xj may be orthogonalized with the approach described above. Unlike orthonormalization, 
standardization does change the minimization problem and results in different estimates of (3. 
However, the standardized solution, being equivariant to changes of scale, is preferred. 

Finally, it is worth noting that the Cholesky decomposition requires all Xj to be full-rank 
matrices. If this is not the case, then either the redundant columns of Xj must be removed prior 
to model fitting or a ridge approach must be taken, as in Simon and Tibshirani (2011). Which 
approach is preferable depends on context and the goals of the analysis. 

2.2 Group lasso 

In this section, we describe the group lasso and algorithms for fitting group lasso models. The 
group lasso estimator, originally proposed by Yuan and Lin (2006), is defined as the value j3 that 
minimizes 

Q(P) = ^ lly - X/3|| 2 + A £ y/Kj \\f3jW . (2.3) 

j 

The idea behind the penalty is to apply a lasso penalty to the Euclidean (L2) norm of each group, 
thereby encouraging sparsity and variable selection at the group level. The solution j3 has the 
property that if group j is selected, then j3jk 7^ for all k, otherwise fijk = for all k. The 
magnitude of \\/3j\\ will tend to be larger if /3 • has more elements; the presence of the y^Kj term 
counteracts this tendency, thereby normalizing for group size. As discussed in Simon and Tibshirani 
(2011), this results in variable selection which is roughly equivalent to the uniformly most powerful 
invariant test for inclusion of the jth group. In what follows, we will absorb the y^Kj term into A 
and use Xj = X^fWj. 
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Figure 1: The impact of orthogonalization on the solution to the group lasso. Contour lines for 
the likelihood (least squares) surface are drawn, centered around the OLS solution, as well as the 
solution path for the group lasso as A goes from to oo. Left: Non-orthogonal X. Right: Orthogonal 
X. 



Yuan and Lin (2006) also propose an algorithm which they base on the "shooting algorithm" 
of Fu (1998). Here, we refer to this type of algorithm as a "group descent" algorithm. The idea 
behind the algorithm is the same as that of coordinate descent algorithms (Friedman et al., 2007; 
Wu and Lange, 2008), with the modification that the optimization of (2.3) takes place repeatedly 
with respect to a group f3j rather than an individual coordinate j3j. 

Below, we present the group descent algorithm for solving (2.3) to obtain the group lasso 
estimator. The algorithm is essentially the same as Yuan and Lin's, although (a) we have generalized 
it to the case of non-orthogonal groups using the approach described in Section 2.1, (b) we restate 
the algorithm to more clearly illustrate the connections with coordinate descent algorithms, and (c) 
we employ techniques developed in the coordinate descent literature to speed up the implementation 
of the algorithm considerably. As we will see, this presentation of the algorithm also makes clear 
how to easily extend it to fit group SCAD and group MCP models in the following sections. 

We begin by noting that the subdifferential (Bertsekas, 1999) of Q with respect to (3j is given 

by 

dQ(Pj) = - Zj + p d + Xjs(3j/ \\PjW , (2.4) 

where Zj = Xj(y — ~K-j/3_j) is the least squares solution, X_j is the portion of the design matrix 
that remains after Xj has been excluded, (3_j are its associated regression coefficients, and s = 1 
for (3j 7^ and s £ [—1, 1] otherwise. The main implication of (2.4) is that, by orthogonalizing the 
groups, Xj drops out of the equation and the multivariate problem of optimizing with respect to 
(3j is reduced to a univariate problem, as the solution must lie on the line segment joining and zj. 
The geometry of this problem is illustrated in Figure 1. As the figure illustrates, orthogonalization 
renders the direction of j3j invariant with respect to A, thereby enabling us to break the solution 
down into two components: determining the direction of /3„- and determining its length. 
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Furthermore, determining the length of (3j is equivalent to solving a one-dimensional lasso 
problem, which has a simple, closed- form solution given by the soft-thresholding operator (Donoho 
and Johnstone, 1994): 



S(z,X) 



z — A if z > X 

if \z\ < A (2.5) 
z + A if z < -A. 



With a slight abuse of notation, we extend this definition to a vector-valued argument z as follows: 



S(z,X) = S(\\z\\,X)^, (2.6) 

where zj ||z|| is the unit vector in the direction of z. In other words, S(z, A) acts on a vector z by 
shortening it towards by an amount A, and if the length of z is less than A, the vector is shortened 
all the way to 0. 

The multivariate soft-thresholding operator is the solution to the single-group group lasso, just 
as the univariate soft-thresholding operator is the solution to the single-variable lasso problem. 
This leads to Algorithm 1 , which is exactly the same as the coordinate descent algorithm described 
in Friedman et al. (2010b) for fitting lasso-penalized regression models, only with multivariate 
soft-thresholding replacing univariate soft-thresholding. Both algorithms are essentially modified 
backfitting algorithms, with soft-thresholding replacing the usual least squares updating. 

Algorithm 1 Group descent algorithm for the group lasso 
repeat 

for j = l,2,...,J 

r'^r-Xj 0^.-/3,) 
until convergence 



In Algorithm 1, f3j refers to the current (i.e., most recently updated) value of coefficients in the 
jth group prior to the execution of the for loop; during the loop, (3j is updated to (3'j. The same 
notation is applied to r, where r refers to the residuals: r = y — J2j ^-jPj- The " refers to the 
fact that f3j and r are being continually updated; at convergence, (3 consists of the final updates 
{/3j}. The expression zj = Xjr + (3j is derived from 

zj = Xj(y - X^/3^.) = Xjr + fy; (2.7) 

the algorithm is implemented in this manner because it is more efficient computationally to update 
r than to repeatedly calculate the partial residuals y — ~K_j/3_j, especially in high dimensions. 

The computational efficiency of Algorithm 1 is clear: no complicated numerical optimization 
steps or matrix factorizations or inversions are required, only a small number of simple arithmetic 
operations. This efficiency is possible only because the groups {Xj} are made to be orthogonal prior 
to model fitting. Without this up-front orthogonalization, we cannot obtain the simple closed-form 
solution (2.6), and the updating steps required to fit the group lasso become considerably more 
complicated, as in Friedman et al. (2010a) and Foygel and Drton (2010). 
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2.3 Group MCP and group SCAD 

We have just seen how the group lasso may be viewed as applying the lasso/soft-thresholding 
operator to the length of each group. Not only does this formulation lead to a very efficient 
algorithm, it also makes it clear how to extend other univariate penalties to the group setting. 
Here, we focus on two popular alternative univariate penalties to the lasso: SCAD, the smoothly 
clipped absolute deviation penalty (Fan and Li, 2001) and MCP, the minimax concave penalty 
(Zhang, 2010). 

The two penalties are similar in motivation, definition, and performance. The penalties are 
defined on [0, oo) for A > as follows, and plotted on the left side of Figure 2: 



SCAD: p Xtl {6) 



MCP: p A>7 (0) 



\9 

7A6>-0.5(6> 2 +A 2 ) 

7-1 
A 2 (7 2 ~l) 
2(7-1) 

■g iffl< 7 A 

hx 2 if e > 7 a 



if < A 
if A < 6 < 7 A 
if 9 > 7 A 



(2.8) 



(2.9) 



To have a well-defined minimum, we must have 7 > 1 for MCP and 7 > 2 for SCAD. Although 
originally proposed for univariate penalized regression, these penalties may be extended to the 
grouped- variable selection problem by substituting (2.8) and (2.9) into (2.2), as has been proposed 
in Wang et al. (2007) and discussed in Huang et al. (to appear). We refer to these penalized 
regression models as the Group SCAD and Group MCP methods, respectively. 

The rationale behind the penalties can be understood by considering their derivatives, which 
appear in the middle panel of Figure 2. MCP and SCAD begin by applying the same rate of 
penalization as the lasso, but continuously relax that penalization until the point at which 9 = 7A, 
where the rate of penalization has fallen all the way to 0. The aim of both penalties is to achieve 
the variable selection properties of the lasso, but to introduce less bias towards zero among the 
true nonzero coefficients. The only difference between the two is that MCP reduces the rate of 
penalization immediately, while SCAD remains flat for a while before moving towards zero. 

The rationale behind the penalties can also be understood by considering their univariate so- 
lutions. Consider the simple linear regression of y upon x, with unpenalized least squares solution 



n 



x y. For this simple linear regression problem, the MCP and SCAD estimators have the 



following closed forms: 



P = F(z,\,>y) 



S(z,X) 
1-1/7 



P = Fs(z,\,j) = { 



if \z\ < 7A 
z if \z\ > 7A, 

'S(z,X) if \z\ < 2A 

S ~W^T if2A<M< 7 A 
z if \z\ > 7A. 



(2.10) 
(2.11) 



Noting that S(z, A) is the univariate solution to the lasso, we can observe by comparison that MCP 
and SCAD scale the lasso solution upwards toward the unpenalized solution by an amount that 
depends on 7. For both MCP and SCAD, when \z\ > 7A, the solution is scaled up fully to the 
unpenalized least squares solution. These solutions are plotted in the right panel of Figure 2; the 
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Figure 2: Lasso, SCAD, and MCP penalty functions, derivatives, and univariate solutions. The 
panel on the left plots the penalties themselves, the middle panel plots the first derivative of 
the penalty, and the right panel plots the univariate solutions as a function of the ordinary least 
squares estimate. The light gray line in the rightmost plot is the identity line. Note that none of 
the penalties are differentiable at f)j = 0. 



figure illustrates how the solutions, as a function of z, make a smooth transition between the lasso 
and least squares solutions. This transition is essential to the oracle property described in the 
introduction. 

As 7 — > oo, the MCP and lasso solutions are identical. As 7 — > 1, the MCP solution becomes 
the hard thresholding estimate zlw^x- Thus, in the univariate sense, the MCP produces the "firm 
shrinkage" estimator of Gao and Bruce (1997); hence the F(-) notation. The SCAD solutions 
are similar, of course, but not identical, and thus involve a "SCAD-modified firm thresholding" 
operator which we denote Fg(-). In particular, the SCAD solutions also have soft-thresholding as 
the limiting case when 7 — > 00, but do not have hard thresholding as the limiting case when 7—7-2. 

We extend these two firm-thresholding operators to multivariate arguments as in (2.6), with F(-) 
or Fs(-) taking the place of S(-), and note that F(zj,X, 7) and Fs(%j,\, 7) optimize the objective 
functions for Group MCP and Group SCAD, respectively, with respect to (3j. An illustration of 
the nature of these estimators is given in Figure 3. We note the following: (1) All estimators carry 
out group selection, in the sense that, for any value of A, the coefficients belonging to a group are 
either wholly included or wholly excluded from the model. (2) The group MCP and group SCAD 
methods eliminate some of the bias towards zero introduced by the group lasso. In particular, at 
A ~ 0.2, they produce the same estimates as a least squares regression model including only the 
nonzero covariates (the "oracle" model). (3) Group MCP makes a smoother transition from to 
the unpenalized solutions than group SCAD. This is the "minimax" aspect of the penalty. Any 
other penalty that makes the transition between these two extremes must have some region (e.g. 
A € [0.7,0.5] for group SCAD) in which its solutions are changing more abruptly than those of 
group MCP. 

It is straightforward to extend Algorithm 1 to fit Group SCAD and Group MCP models; all 
that is needed is to replace the multivariate soft-thresholding update with a multivariate firm- 
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Figure 3: Representative solution paths for the group lasso, group MCP, and group SCAD methods. 
In the generating model, groups A and B have nonzero coefficients and while those belonging to 
group C are zero. 



thresholding update. The group updates for all three methods are listed below: 



Group lasso : (3j 
Group MCP : f3 j 
Group SCAD : f3 j 



5(z ij A) = 5(||z i ||,A)- 



F(z j ,X, 1 ) = F(\\z j \\,X^)j 
F(z J ,X, 7 ) = F s (\\z j \\,X, 7 ) ] 



Note that all three updates are simple, closed-form expressions. Furthermore, as each update 
minimizes the objective function with respect to (3j, the algorithms possess the descent property, 
meaning that they decrease the objective function with every iteration and are therefore guaranteed 
to converge, a fact which we now state formally. 

Proposition 1. Let denote the value of the fitted regression coefficient at the end of iteration 
m. At every iteration of the proposed group descent algorithms for linear regression models involving 
group lasso, group MCP, or group SCAD penalties, 

Q(p(™+1)) < Q(/3( m )). 

Furthermore, the sequence {(3^ l \(3^\ . . .} is guaranteed to converge to a minimum of Q. 

For the group lasso penalty, the objective function being minimized is convex, and thus the 
above proposition establishes convergence to the global minimum. For group MCP and group 
SCAD, whose objective function is a sum of convex and nonconvex components, convergence to a 
local minimum is possible. 

In conclusion, the algorithms we present here for fitting group lasso, group MCP, and group 
SCAD models are both fast and stable. We examine the empirical running time of these algorithms 
in Section 3. 
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2.4 Logistic regression 

It is possible to extend the algorithms described above to fit group-penalized logistic regression 
models as well, where the loss function is the negative log-likelihood of a binomial distribution: 

i i 

Recall, however, that the simple, closed form solutions of the previous sections were possible only 
with orthogonalization. The iteratively reweighted least squares (IRLS) algorithm typically used to 
fit generalized linear models (GLMs) introduces a n _1 X T WX term into the score equation (2.4), 
where W is an n x n diagonal matrix of weights. Because n _1 X T WX 7^ I, the group lasso, group 
MCP, and group SCAD solutions will lack the simple closed forms of the previous section. 

However, we may preserve the sphericity of the likelihood surface (Figure 1) through the ap- 
plication of a majorization- minimization (MM) approach (Lange et al., 2000; Hunter and Lange, 
2004). In the context of penalized logistic regression, this approach was proposed by Krishnapuram 
et al. (2005), who referred to it as a bound optimization algorithm. The application of the method 
depends on the ability to bound the Hessian of the loss function with respect to the linear predictor 
rj. Let v = maxj sup r? {V 2 Lj(ry)}, so that vi — V 2 L(rj) is a positive semidefinite matrix at all points 
rj. For logistic regression, where 

vr l = p(y, = i| ??l ) = r |^-, 

we may easily obtain v = 1/4, since V 2 Lj(r/) = 7r(l — it). 

Bounding the loss function in this manner allows us to define 

Li-qW) = L(rf) + (V~ V*) T VL(rj*) + H (t? _ V *) T ( V - rj*) 

such that the function L(?7|r7*) has the following two properties: 

L( V *\rj*) = L(rf) 
L( V \ V *) > L(rj). 

Thus, L(r)\r]*) is a majorizing function of L(r]). The theory underlying MM algorithms then ensures 
that Algorithm 2, which consists of alternately executing the majorizing step and the minimization 
steps, will retain the descent property of the previous sections, which we formally state below. 

Proposition 2. Let (3^ denote the value of the fitted regression coefficient at the end of iteration 
m. At every iteration of the proposed group descent algorithms for linear or logistic regression 
models involving group lasso, group MCP, or group SCAD penalties, 

q^^+i)) < Q(/3M). 

Furthermore, provided that Q((5) does not tend to —00, the sequence {(3^,(3^, . . .} is guaranteed 
to converge to a stationary point of Q. 
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For the group lasso penalty, the objective function being minimized is convex, and thus any 
stationary point is a global minimum. For group MCP and group SCAD, convergence to some 
other type of stationary point such as a saddle point cannot be ruled out, but is not a practical 
concern. 

A greater practical concern is the fact that, unlike linear regression, the loss function for logistic 
regression is not bounded below and can go to — oo. In practice, this is not a concern for large 
values of A, but saturation of the model will certainly occur when p > n and A is small. Our 
implementation in grpreg terminates the path-fitting algorithm if saturation is detected, based on 
a check of whether > 99% of the null deviance has been explained by the model. 

Algorithm 2 Group descent algorithm for logistic regression with a group lasso penalty 
repeat 
Tj <- X/3 

^K7(i + ^)KLi 

f <- (y - ir)/v 
for j = l,2,..., J 
z j = X.Jr + (3 j 

until convergence 



Writing L(r]\rj*) in terms of /3, we have 

L(/3)«^(y-X/3) T (y-X/3), 

where y = r)* + (y — ir)/v is the pseudo-response vector. Thus, the gradient of L(r/\r}*) with respect 
to (3j is given by 

VL(p j ) = -vz j +v0 j , (2.12) 

where, as before, Zj = Xj(y — X_j/3_j-) is the vector of partial (pseudo-) residuals for (3j. 

The presence of the scalar v in the score equations affects the updating equations; however, as 
the majorized loss function remains spherical with respect to (3, the updating equations still have 
simple, closed form solutions: 

1 1 z 

Group lasso : /3 - < S(vzj,X) = -S(v ||z 7 -|| , A) - 

v v \\ z j\\ 

Group MCP : (3 j <- -F(vZj, A, 7) = -F(v ||z,-|| , A, 7)77^7 

V V ||Zj|| 

Group SCAD : (3, <- -F(vzj, A, 7) = -F s (v ||z,-|| , A, 7)77^77 

V V ||Zj|| 

Algorithm 2 is presented for the group lasso, but is easily modified to fit group MCP and group 
SCAD models by substituting the appropriate expression into the updating step for f3j. 

This approach unfortunately cannot be directly extended to other generalized linear models, as 
the Hessian matrices for other exponential families are typically unbounded. One possibility is to 
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set v maxj{V 2 Lj(r7*)} at the beginning of each iteration as a pseudo-upper bound. As this is 
not an actual upper bound, such an algorithm is not guaranteed to possess the descent property 
established above. It may, however, prove adequate in practice, although the authors have not 
examined this proposal. 

2.5 Path- fitting algorithm 

The above algorithms are presented from the perspective of fitting a penalized regression model 
for a single value of A. Usually, we are interested in obtaining (3 for a range of A values, and then 
choosing among those models using either cross-validation or some form of information criterion. 
The regularization parameter A may vary from a maximum value A max at which all penalized 
coefficients are down to A = or to a minimum value A m i n beyond which the model becomes 
excessively large. When the objective function is strictly convex, the estimated coefficients vary 
continuously with A € [A m i n , A max ] and produce a path of solutions regularized by A. An examples 
of such a path may be seen in Figure 3. 

Algorithms 1 and 2 are iterative and require initial values; the fact that (3 = at A max provides 
an efficient approach to choosing those initial values. Group lasso, group MCP, and group SCAD 
all have the same value of A max ; namely, A max = max.,-{||z.,-||} for linear regression or A max = 
maxj{f \\zj ||} for logistic regression, where the {zj} are computed with respect to the intercept- 
only model (or, if unpenalized covariates are present, with respect to the residuals of the fitted 
model including all of the unpenalized covariates). Thus, by starting at A max where (3 = is 
available in closed form and proceeding towards A m i n , using (3 from the previous value of A as the 
initial value for the next value of A, we can ensure that the initial values will never be far from the 
solution, a helpful property often referred to as "warm starts" in the path-fitting literature. 

3 Algorithm efficiency 

Here, we briefly comment on the efficiency of the proposed algorithms. Regardless of the penalty 
chosen, the most computationally intensive steps in Algorithm 1 are the calculation of the inner 
products Xjr and Xj(/3j — (3j), each of which requires 0(nKj) operations. Thus, one full pass 
over all the groups requires 0(2np) operations. The fact that this approach scales linearly in p 
allows it to be efficiently applied to high-dimensional problems. 

Of course, the entire time required to fit the model depends on the number of iterations, which 
in turn depends on the data and on A. Broadly speaking, the dominant factor influencing the 
number of iterations is the number of nonzero groups at that value of A, since no iteration is 
required to solve for groups that remain fixed at zero. Consequently, when fitting a regularized 
path, a disproportionate amount of time is spent at the least sparse end of the path, where A is 
small. 

Table 1 compares our implementation (grpreg) with two other publicly available R packages 
for fitting group lasso models over increasingly large data sets: the grplasso package (Meier 
et al., 2008) and the standGL package (Simon and Tibshirani, 2011). We note that (a) the grpreg 
implementation appears uniformly more efficient than the others, and (b) that group MCP and 
group SCAD tend to be slightly faster than group lasso. Presumably, this is caused by the fact 
that their solution paths tend to be more sparse. 

It is worth noting that all three of these packages can handle p > n problems; however, for the 
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Table 1: Comparison of (grpreg) with other publicly available group lasso packages. Times are 
in seconds required to fit the entire solution path over a grid of 100 A values, averaged over 100 
independent data sets. Each group consisted of 10 variables; thus p ranges over {10,100,1000} 
across the columns. 



n=50 n=500 n=5000 









J=l 


J=10 


J=100 


Linear 


grpreg 


Group lasso 


0.01 


0.07 


18.42 


regression 


grpreg 


Group MCP 


0.01 


0.05 


10.66 




grpreg 


Group SCAD 


0.01 


0.05 


11.04 




grplasso 


Group lasso 


0.16 


1.91 


60.96 




standGL 


Group lasso 


0.04 


1.27 


150.99 


Logistic 


grpreg 


Group lasso 


0.01 


0.21 


107.90 


regression 


grpreg 


Group MCP 


0.01 


0.13 


46.26 




grpreg 


Group SCAD 


0.01 


0.14 


83.08 




grplasso 


Group lasso 


1.20 


5.15 


203.22 




standGL 


Group lasso 


0.43 


6.40 


399.42 



purposes of timing, we chose to restrict our attention to problems in which the entire path can 
be computed. Otherwise, different implementations may terminate the fitting process at different 
points along the path, which would prevent a fair comparison of computing times. 

4 Simulation studies 

In this section, we compare the performance of group lasso, group MCP, and group SCAD 
using simulated data. First, a relatively basic setting is used to illustrate the primary advantages 
of group MCP and group SCAD over group lasso. We then attempt to mimic two settings in which 
the methodology might be used: to allow flexible semiparametric modeling of continuous variables 
and in genetic association studies, which involve large numbers of categorical variables. We use the 
term "null predictor" to refer to a covariate whose associated regression coefficient is zero in the 
true model. 

In all of the studies, five-fold cross-validation was used to choose the regularization parameter 
A. Group SCAD and group MCP have an additional tuning parameter, 7. In principle, one may 
attempt to select optimal values of 7 using, for example, cross-validation over a two-dimensional 
grid or using an information criterion approach. Here, we fix 7 = 3 for group MCP and 7 = 4 for 
group SCAD, roughly in line with the default recommendations suggested in Fan and Li (2001) 
and Zhang (2010) in the non-grouped case. 

In Section 4.1, we evaluate model accuracy by root mean squared error (RMSE): 

RMSE = ll^Wjk-M 2 . 
V i> fc 

In Sections 4.2 and 4.3, because the model fit to the data is not always the same as the generating 
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Figure 4: The impact of increasing coefficient magnitude on group regularization methods. Model 
size is given in terms of number of groups (i.e., the number of variables in the model is four times 
the amount shown). The faint gray line on the left is the theoretically optimal RMSE than can be 
achieved in this setting. The faint gray line on the right is the true model size. 



model, we focus on root model error (RME) instead: 



RME = 




where m and pb% denote the true and estimated mean of observation i given x, . Note that the model 
error, which is also discussed in Fan and Li (2001), is equal in expected value to the prediction error 
minus the irreducible error a 2 . In all simulations, results are averaged over 1,000 independently 
generated data sets. 



4.1 Basic 

We begin with a very straightforward study designed to illustrate the basic shortcomings of the 
group lasso in comparison with group MCP and group SCAD. The design matrix consists of 100 
groups, each with 4 elements. In five of these groups, the coefficients are ±/3; in the others, the 
true regression coefficients are zero. Covariate values and errors were generated from the standard 
normal distribution. We fixed the sample size at 100 (i.e., n=100, p=400) and varied \/3\. In 
principle, group lasso should struggle when \/3\ is large, as it cannot alleviate the problem of bias 
towards zero for large coefficients without lowering A and thereby allowing null predictors to enter 
the model. Indeed, as Figure 4 illustrates, this is exactly what occurs. 

For small values of the regression coefficients, all three group regularization methods perform 
similarly. As we increase the magnitude of these coefficients, however, group MCP and group 
SCAD begin to estimate j3 with an error approaching the theoretically optimal value, while group 
lasso performs increasingly poorly. Furthermore, group MCP and group SCAD select much smaller 
models and approach the true model size much faster than group lasso, which selects far too many 
variables. 
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Comparing group MCP and group SCAD, the two are nearly identical in terms of estimation 
accuracy. However, group MCP selects a considerably more sparse model, and has better variable 
selection properties. Thus, although the two methods behave similarly in an asymptotic setting, 
group MCP seems to have somewhat better finite-sample properties. 

In the left panel of Figure 4, we include an "oracle" RMSE for reference; we now clarify what 
exactly we mean by this. An oracle model (one which knows in advance which coefficients are 
zero and which are nonzero) would be able to achieve a mean squared error of zero for the zero- 
coefficient variables and a total MSE of tr{(X T X) -1 } for the nonzero variables. In our simulation X 
was random, with E(XX T ) = I. Thus, the total MSE of the oracle model is approximately tr(Io/n) 
where Io is the identity matrix with dimension equal to that of the nonzero coefficients, with RMSE 
y/s/n, where s is the sparsity fraction (i.e., the fraction of coefficients that are nonzero). Note that 
in any finite sample, the columns of X will be correlated, so even the oracle model cannot achieve 
this RMSE for finite sample sizes; the gray line in Figure 4 is thus the optimal RMSE that could 
be theoretically be achieved in this setting. 



4.2 Semiparametric regression 

Our next simulation involves groups of covariates constructed by taking basis expansions of 
continuous variables to allow for flexible covariate effects in semiparametric modeling. The sample 
size was 200 and the data consisted of 100 variables, each of which were generated as independent 
uniform (0,1) variates. The first six variables had potentially nonlinear effects given by the following 
equations: 

f^x) = 2(e~ 10x - e- 10 )/(l - e' 10 ) - 1 
f 2 (x) = -2(e~ Wx - e- 10 )/(l - e' 10 ) + 1 
f 3 (x) = 2x-l 
f 4 (x) = -2x + l 
f 5 (x) = 8(x-0.5) 2 -l 
f 6 (x) = -8(x - 0.5) 2 + 1; 

the other 94 had no effect on the outcome. The scaling of the functions is to ensure that each 
variable attains a minimum and maximum of ( — 1,1) over the domain of x and thus that the 
effects of all six variables are roughly comparable. Visually the effect of the six nonzero variables 
is illustrated below: 




Xi x 2 x 3 x 4 x 5 x 6 



For model fitting, each variable was represented using a 6-term B-spline basis expansion (i.e., X 
had dimensions n = 200, p = 600). In addition to the three group selection methods, models were 
also fit using the lasso (i.e., ignoring grouping). Results are given in Table 2. 

Certainly, all three group selection approaches greatly outperform the lasso here. However, as in 
the previous section, group MCP and group SCAD are able to achieve superior prediction accuracy 
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Table 2: Prediction and variable selection accuracy for the semiparametric regression simulation. 







Group 


Group 


Group 




Lasso 


Lasso 


MCP 


SCAD 


RME 


0.73 


0.59 


0.50 


0.52 


Variables selected 


31.5 


29.3 


10.4 


23.1 



which selecting more parsimonious models. Also as in the previous simulation, group MCP and 
group SCAD perform similarly as far as prediction accuracy, but group MCP is seen to have better 
finite-sample variable selection properties — recall that the true number of variables in the model 
is only 6. 

4.3 Genetic association study 

Finally, we carry out a simulation designed to mimic a small genetic association study involving 
single nucleotide polymorphisms (SNPs). Briefly, a SNP is a point on the genome in which multiple 
versions (alleles) may be present. A SNP may take on three values {0,1,2}, depending on the 
number of minor alleles present. The effect is not necessarily linear — for example, if the allele has 
a recessive effect, the phenotype associated with x = and x = 1 are identical, while the phenotype 
associated with x = 2 is different. In such studies, it is desirable to have a method which is robust 
to different mechanisms of action, yet powerful enough to actually detect important SNPs, as the 
number of SNPs is typically rather large. 

We simulated data involving 250 subjects and 500 SNPs, each of which was represented with 2 
indicator functions (i.e., n=250, p= 1,000). Three of the variables had an effect on the phenotype 
(one dominant effect, one recessive effect, one additive effect); the other 497 did not. In addition 
to the three group selection methods, we included for comparison two versions of the lasso: one 
applied to all p = 1, 000 variables and ignoring grouping, the other assuming an additive effect for 
each genotype. Note that for the second approach, p = 500 as we estimate only a single coefficient 
for each SNP. The results of this simulation are given in Table 3. 

Table 3: Prediction and variable selection accuracy for the genetic association study simulation. 
True discoveries are selected variables that have a truly nonzero effect; false discoveries are selected 
SNPs that have no effect on the phenotype. 







Additive 


Group 


Group 


Group 




Lasso 


Lasso 


Lasso 


MCP 


SCAD 


RME 


0.38 


0.37 


0.34 


0.25 


0.27 


True discoveries 


2.6 


2.2 


2.6 


2.4 


2.5 


False discoveries 


21.1 


16.3 


15.5 


4.0 


12.5 



The same broad conclusions may be reached here as in the previous simulations. In particular, 
we note that (1) The group selection methods outperform the variable selection methods that 
either do not account for grouping or that attempt to incorporate grouping in an ad-hoc fashion. 
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(2) Group MCP and group SCAD outperform the group lasso both in terms of prediction accuracy 
as well as the number of false discoveries. (3) Although group MCP and group SCAD are similar 
in terms of prediction accuracy, group MCP has significantly better variable selection properties, 
producing only four false discoveries compared to group SCAD's 12.5. 

5 Real data 

We give two examples of applying grouped variable selection methods to real data. The first 
is a gene expression study in rats to determine genes associated with Bardet-Biedl syndrome. 
The second is a genetic association study to determine SNPs associated with age-related macular 
degeneration. As in the previous section, we fix 7 = 3 for group MCP, 7 = 4 for group SCAD, and 
select A via cross-validation. 

5.1 Bardet-Biedl syndrome gene expression study 

The data we analyze here is discussed more fully in Scheetz et al. (2006). Briefly, the data 
set consists of normalized microarray gene expression data harvested from the eye tissue of 120 
twelve- week-old male rats. The outcome of interest is the expression of TRIM32, a gene which 
has been shown to cause Bardet-Biedl syndrome (Chiang et al., 2006). Bardet-Biedl syndrome is a 
genetic disease of multiple organ systems including the retina. 

Following the approach in Scheetz et al. (2006), 18,976 of the 31,042 probe sets on the array 
"exhibited sufficient signal for reliable analysis and at least 2-fold variation in expression." These 
probe sets include TRIM32 and 18,975 other genes that potentially influence its expression. We 
further restricted our attention to the 5,000 genes with the largest variances in expression (on the 
log scale) and considered a three-term natural cubic spline basis expansion of those genes, resulting 
in a grouped regression problem with n = 120 and p = 15, 000. The models selected by group lasso, 
group MCP, and group SCAD are described in Table 4. 

This is an interesting case study in that group MCP selects a very different model from the 
other two approaches. In particular, group lasso and group SCAD each select a fairly large number 
of genes, while shrinking each gene's group of coefficients nearly to zero. Group MCP, on the other 
hand, selects a single gene and returns a fit nearly the same as the least-squares fit for that gene 
alone. The relationship between probe set 1372928_at and TRIM32 estimated by each model is 
plotted in Figure 5. 

The most important aspect of Figure 5 to note is that the outcome, TRIM32, has a large outlying 
value almost 10 standard deviations below the mean of the rest of the points. This observation 
has a large impact on the fit: for all of the genes in Table 4, this subject is also responsible for the 
lowest/highest or second- lowest /highest value of that gene, and the median absolute correlation 
for the genes in the table is 0.62. Many of the scatterplots of those genes versus TRIM32 look 
qualitatively similar to the one in Figure 5. 

Faced with this set of genes, group MCP selects a single gene and fits a model that explains 
65% of the variance in TRIM32 expression. Group SCAD and group lasso select an ensemble of 
correlated genes, downweighting the contribution of each gene considerably. Each approach has 
advantages, depending on the goal of the analysis. The group SCAD/lasso approaches avoid a 
possibly arbitrary selection of one gene from a highly correlated set, and produce a model with 
somewhat better predictive ability (cross-validation error of 0.0085 ± 0.001 versus 0.0098 ± 0.002), 
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Table 4: Genes selected by group lasso/SCAD/MCP, along with the Euclidean norm of the coeffi- 
cients for each gene's basis expansion. 



Group norm 



Probe 


Gene 


Group 


Group 


Group 


Set 


Symbol 


Lasso 


MCP 


SCAD 


1374131_at 




0.11 




0.13 


1383110_at 


Klhl24 


0.10 




0.08 


1383749_at 


Phosphol 


0.02 




0.04 


1376267_at 




0.22 




0.23 


1377791_at 




0.13 




0.12 


1376747_at 




0.28 




0.24 


1390539_at 




0.11 




0.12 


1384470_at 




0.05 




0.07 


1386032_at 


Prkd3 


0.03 




0.06 


1393231_at 


Ppp4r2 


0.01 




0.03 


1385798_at 




0.03 






1383730_at 


Ttc9c 


0.02 




0.06 


1368476_at 


Nr3c2 


0.05 




0.01 


1384860_at 


Zfp84 


0.03 




0.07 


1372928_at 




0.22 


1.83 


0.20 


1381902_at 


Zfp292 


0.16 




0.18 


1390574_at 




0.02 




0.01 


1384940_at 


Zfp518a 


0.10 




0.10 



although all three approaches are within random variability of each other. Group MCP, on the 
other hand, produces a highly parsimonious model capable of predicting just as well as the group 
lasso/SCAD models despite using only a single gene. This is potentially a valuable property if the 
goal is, say, to develop a diagnostic assay and each gene that need to be measured adds cost to the 
assay. 

Finally, it should be noted that although group MCP is behaving in a rather greedy fashion in 
this example, this is not an inherent aspect of the method. By adjusting the 7 parameter, group 
MCP can be made to resemble the group lasso and group SCAD solutions as well - recall that 
group lasso can be considered a special case of group MCP with 7 = 00. Group SCAD, on the 
other hand, cannot be made to resemble group MCP and is incapable in this case of selecting a 
highly parsimonious model. Group MCP is considerably more flexible, although of course to pursue 
this flexibility, proper selection of tuning parameters becomes an important issue. The selection 
of additional tuning parameters is an important area for further study in both the grouped and 
non-grouped application of the MC penalty. 

5.2 Age-related macular degeneration genetic association study 

We analyze data from a case-control study of age-related macular degeneration consisting of 
400 cases and 400 controls. We confine our analysis to 532 SNPs that previous biological studies 
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6.5 7.0 7.5 8.0 8.5 

1372928_at 

Figure 5: Estimated relationship between probe set 1372928_at and TRIM32 estimated by group 
lasso, group MCP, and group SCAD. Estimates are superimposed on top of a scatterplot and 
restricted to pass through the mean expression for each probe set. 

have suggested may be related to the disease. As in Section 4.3, we represent each SNP as a three- 
level factor depending on the number of minor alleles the individual carries at that locus. This 
requires the expansion of each SNP into a group of two indicator functions; our design matrix for 
this example thus has n = 800 and p = 1,064. Group regularized logistic regression models were 
then fit to the data; as before, A was selected via cross-validation. The results are presented in 
Table 5. 

Table 5: Application of group regularized methods to age-related macular degeneration study. The 
number of SNPs selected by the method, cross-validation error, and associated standard error are 
reported. For comparison, the intercept-only model is also listed ("Baseline"). 





SNPs 


CV Error 


SE 


Baseline 




0.250 


0.0002 


grLasso 


45 


0.237 


0.004 


grMCP 


13 


0.237 


0.004 


grSCAD 


35 


0.237 


0.004 



Although the three group regularization approaches are equivalent in terms of out-of-sample 
prediction accuracy in this example, it is worth noting the following: (1) All three approaches 
represent significant improvements over the baseline (intercept-only) cross-validation error, and 
(2) again, the group MCP approach offers a considerably more parsimonious model with no loss in 
prediction accuracy. Compared with the gene expression example of the previous section, parsimony 
is both more desirable and more believable in this example. Unlike the gene expression case, the 
SNPs selected here by the group lasso are not highly correlated (median absolute correlation is 
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only 0.03) and thus more likely to represent independent causes than dependent manifestations of 
a separate underlying cause such as up-regulation of a pathway. 



6 Conclusion 

Group MCP and group SCAD models are powerful alternatives to the group lasso in problems 
involving grouped variable selection. However, application and study of these approaches has been 
limited, especially in high-dimensional problems, due to a lack of efficient algorithms and a lack 
of publicly available software for fitting these models. In this article, we attempt to remedy this, 
describing the development of efficient algorithms and proving an implementation via the R package 
grpreg. 

Appendix 

Before proving Proposition 1, we establish the groupwise convexity of all the objective functions 
under consideration. Note that for group SCAD and group MCP, although they contain nonconvex 
components and are not necessarily convex overall, the objective functions are still convex with 
respect to the variables in a single group. 

Lemma 1. The objective function Q(/3j) for regularized linear regression is a strictly convex func- 
tion with respect to (3j for the group lasso, for group SCAD with 7 > 2, and for group MCP with 

7 > 1. 

Proof. Although Q((3j) is not differentiable, it is directionally twice differentiable everywhere. Let 
V^Qiflj) denote the second derivative of Q((3j) in the direction d. Then the strict convexity of 
Q((3j) follows if V\Q(j3j) is positive definite at all j3j and for all d. Let denote the infimum 
over (3j and d of the minimum eigenvalue of V\Q{(5j). Then, after some algebra, we obtain 

£* = 1 Group lasso 

£* = 1 — Group SCAD 

7-1 

£* = 1 - - Group MCP, 

7 

These quantities are positive under the conditions specified in the lemma. □ 

We now proceed to the proof of Proposition 1. 

Proof of Proposition 1. The descent property is a direct consequence of the fact that each updating 
step consists of minimizing Q(j3) with respect to (3j. Lemma 1, along with the fact that the loss 
function is continuously differentiable, provide sufficient conditions to apply Proposition 5.1 of 
Tseng (2001), thereby establishing convergence to a minimum of Q((3). □ 

For Proposition 2, involving logistic regression, we proceed similarly, letting R(f3\j3) denote the 
majorizing approximation to Q{f3) at 0. 
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Lemma 2. The majorizing approximation R(j3j\f3) for regularized logistic regression is a strictly 
convex function with respect to (3j at all 13 for the group lasso, for group SCAD with 7 > 8, and 
for group MCP with 7 > 4. 

Proof. Proceeding as in the previous lemma, and letting denote the infimum over (3, @j and d 
of the minimum eigenvalue of V\Q{f3j), we obtain 

£* = ^ Group lasso 

£>, = — Group SCAD 

4 7 — 1 

£* = \ - - Group MCP, 

4 7 

These quantities are positive under the conditions specified in the lemma. □ 

Proof of Proposition 2. The proposition makes two claims: descent with every iteration and guar- 
anteed convergence to a stationary point. To establish descent for logistic regression, we note that 
because L is twice differentiate, for any point r\ there exists a vector rf* on the line segment 
joining r\ and rf such that 

L(ri) = L( V *) + (r, - r,*) T VL(r,*) + \{r 1 - r,*) T V 2 L{r,**){r, - r,*) 

< L(V\V*) 

where the inequality follows from the fact that vl — V 2 L(r]**) is a positive semidefmite matrix. 
Descent now follows from the descent property of MM algorithms (Lange et al., 2000) coupled with 
the fact that each updating step consists of minimizing R((3j\f3). 

To establish convergence to a stationary point, we note that if Q{j3) does not go to —00, 
then the descent property of the algorithm ensures that the sequence (3^ stays within a compact 
set and therefore possesses a limit point f3. Then, as in the proof of Proposition 1, Lemma 2 
allows us to apply the results of Tseng (2001) and conclude that (3 must be a stationary point of 
R{(3\(3). Furthermore, because R(f3\(3) is tangent to Q((3) at /3, (3 must also be a stationary point 
ofQ(/3). □ 
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